Effects of Community Structure on Search and Ranking in Information Networks 
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The World-Wide Web (WWW) is characterized by a strong community structure in which commu- 
nities of webpages (e.g. those sharing a common keyword) are densely interconnected by hyperlinks. 
We study how such network architecture affects the average Google ranking of individual webpages 
in the comunity. It is shown that the Google rank of community webpages could either increase or 
decrease with the density of inter-community links depending on the exact balance between average 
in- and out-degrees in the community. The magnitude of this effect is described by a simple analyti- 
cal formula and subsequently verified by numerical simulations of random scale-free networks with a 
desired level of the community structure. A new algorithm allowing for generation of such networks 
is proposed and studied. The number of inter-community links in such networks is controlled by a 
temperature-like parameter with the strongest community structure realized in "low-temperature" 
networks. 
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The World Wide Web (WWW) - a very large (~ lO^o 
nodes) network consisting of webpages connected by hy- 
perlinks - presents a challenge for the efficient informa- 
tion retrieval and ranking. Apart from the contents of 
webpages, the topology of the network itself can be a rich 
source of information about their relative importance and 
relevance to the search query. It is the effective utilization 
of this topological information T] which advanced the 
Google search engine to its present position of the most 
popular tool on the WWW and a profitable company 
with a current market capitalization around $30 billion. 
To rank the importance of webpages Google simulates 
the behavior of a large number of "random surfers" who 
just follow a randomly selected hyperlink on each page 
they visit. The number of hits a given page gets in the 
course of such simulated process determines its ranking. 
It is intuitively clear that the larger is the number of hy- 
perlinks pointing to a given webpage (its in-degree in the 
network) the higher are the chances of a random surfer 
to click on one of them and, therefore, the higher would 
be the resulting Google rank of this webpage. However, 
the algorithm goes beyond just ranking nodes based on 
their in-degrees. Indeed, the traffic directed to a given 
webpage along a particular incoming hyperlink is propor- 
tional to the popularity of the webpage containing this 
link. Therefore, the Google rank of a node is given by 
the weighted in-degree where the weight of each neigh- 
boring webpage reflects its importance and is determined 
self-consistently. The WWW is a very heterogeneous col- 
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lection of webpages which can be grouped based on their 
textual contents, language in which they are written, the 
Internet Service Provider (ISP) where they are hosted, 
etc. Therefore, it should come as no surprise that the 
WWW has a strong community structure in which 
similar pages are more likely to contain hyperlinks to 
each other than to the outside world. Formally a web 
community can be defined as a collection of webpages 
characterized by a higher than average density of links 
connecting them to each other. In this letter we are go- 
ing to address the question: how the community struc- 
ture affects the Google rank of webpages inside the com- 
munity. One might naively expect that the community 
structure always boosts the Google rank of its webpages 
as it tends to "trap" the random surfer inside the com- 
munity for a longer time. However, it turned out that 
it is not generally true. In fact the Google rank of com- 
munity webpages could either increase or decrease with 
the density of inter-community links depending on the 
exact balance between average in- and out-degrees in the 
community. In the heart of the Google search engine lies 
the PageRank algorithm determining the global "impor- 
tance" of every web page based on the link structure of 
the WWW network around it. While the details of the al- 
gorithm have undoubtedly changed since its introduction 
in 1997, the central "random surfer" idea first described 
in 1] remained essentially the same. To a physicist the 
algorithm behind the PageRank just simulates an aux- 
iliary diffusion process taking place on the network in 
question. Similar diffusion algorithms have been recently 
applied to study citation and metabolic networks and 
the modularity of the Internet on the "hardware level" 
represented by an undirected network of interconnections 
between Autonomous Systems |^. A large number of 
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random walkers are initially randomly distributed on the 
network and are allowed to move along its directed links. 
As in principle some nodes in the network could have 
zero out-degree but non-zero in-degree and would thus 
"trap" random walkers, the authors of the algorithm in- 
troduced a finite probability a for a surfer to randomly 
select a page in the network and directly jump there with- 
out following any hyperlinks. This leaves the probability 
1 — a for a surfer to randomly select and follow one of 
the hyperlinks of the current webpage. According to Q 
the original PageRank algorithm used a — 0.15. The 
PageRank then simulates this diffusion process until it 
converges to a stationary distribution. The Google rank 
(PageRank) G{i) of a node i is proportional to the num- 
ber of random walkers at this node in such steady state. 
We chose to normalize it so that G{i) = 1 but in gen- 
eral the normalization factor does not matter as ranking 
relies on relative values of G{i) for different webpages. 
When one enters a search keyword such as e.g. "statisti- 
cal physics" on the Google website the search engine first 
localizes the subset of webpages containing this keyword 
and then simply presents them in the descending order 
based on their PageRank values. The main equation de- 
termining the PageRank values G{i) for all webpages in 
the WWW is 

G{z)=a + J2i^~a)-^^ (1) 

Here Kout{j) denotes the the number of hyperlinks (the 
out-degree) in the node j and the summation goes over 
all nodes j that have a hyperlink pointing to the node 
i. In the matrix formalism the PageRank values are 
given by the components of the principal eigenvector 
of an asymmetric positive matrix related to the adja- 
cency matrix of the network. Such eigenvector could 
be easily found using a simple iterative algorithm 
The fast convergence of this algorithm is ensured by the 
fact that the adjacency matrix of the network is sparse. 
We first consider the effect of the community structure 
on Google ranking in the simplest and most physically 
transparent case of a = 0. In order for the algorithm to 
properly converge in this case we need to assume that 
Kout{i) > for all nodes in the network. Consider a 
network in which Nc nodes form a community character- 
ized by higher than average density of edges linking these 
nodes to each other. Let Ecw denote the total number 
of hyperlinks pointing from nodes in the community to 
the outside worlds while E^c - the total number of hy- 
perlinks pointing in the opposite direction (See Fig. ^ 
for an illustration). Similarly Ecc and denote the 
total number of links connecting nodes within the com- 
munity and, respectively, the outside world. The total 
number of hyperlinks pointing to nodes inside the com- 
munity is given by Ecc + Eyjc = Nc{Ki„)c where {K^ri)c 
is the average in-degree of community nodes. Similarly, 
Ecc+Ecw = Nc(Kout)c, where {Kout)c is the average out- 
degree in the community, gives the total number of hyper- 
links originating on community nodes. The Google rank 




FIG. 1: The illustration of hyperlink connections between 
the community C and the outside world W. Ecw and E^c are 
numbers of links from the community to the outside world 
and from the outside world to the community, respectively. 

is computed in the steady state of the diffusion process 
where the average number of random surfers currently 
visiting any given webpage does not change with time. 
This means that the total current of surfers Jew leaving 
the community for the outside world must be precisely 
balanced by the current J^c entering the community dur- 
ing the same time interval. Let Gc — {G(i))i£c denote 
the average Google rank inside the community given by 
the average number of random surfers on its nodes. If 
edges pointing away from the community to the outside 
world start at an unbiased selection of nodes in the com- 
munity the average current flowing along any of those 
edges would be given by Gc/ {Kout)c while the total cur- 
rent leaving the community Jew — EcwGc/ {Kout)c- Simi- 
lar analysis gives Jwc = EwcGw/ {Kout)w, where {Kout)w 
is the average out-degrees of nodes in the world outside 
the community. Balancing these two currents one gets: 

Gc Ewe {-Eout)c 

Gw Ecw w 

The Eq. |21is based on the "mean- field" assumption that 
average values of the Google rank and the out-degree 
on those community nodes that actually send links to 
the outside world are equal to their overall average val- 
ues inside the community It is tempting to assume 
that higher than average density of hyperlinks connect- 
ing nodes in the community is beneficial for the Google 
rank of its nodes as it "traps" random surfers to spend 
more time within the community. It turned out that this 
naive argument is not necessarily true. In fact one is 
equally likely to observe an opposite effect: an excess of 
intra-comniunity links could lead to a lower than average 
Google rank of its nodes. To see it explicitly one should 
replace Ewe and Eew in Eq. |21with identical expressions 
{Kin)cNe - Eec and {Kout)cNe - Ecc respectively: 

Ge^ ^ f {K,„)eNc - Ecc \ _ {Ko,a)c 
Gw \ {Kout)cNc — Ecc J {Koutjw 

From this equation it follows that enhancing the commu- 
nity structure (increasing Ecc) while keeping other pa- 
rameters such as {Kin)c,{Kout)c and {Kout)w fixed can 
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be both good and bad for the average Google rank of 
the community webpages. It depends on {Kin)c/ {Koutlc 
- the ratio between average in- and out-degrees of com- 
munity nodes. If the ratio is less than 1 the increase in 
Ecc leads to a further decrease of Gc/Gw below one. If 
the community constitutes just a small fraction of the 
whole network one could safely assume that Gw remains 
approximately constant so that the average Google rank 
of the community, Gc, has to decrease. Similarly if the 
ratio is larger than 1, Gc grows with the number of inter- 
community links Ecc (see Fig. 2 for an illustration of 
both cases). The real-life Google algorithm uses a non- 




-2 



500 1000 _ 1500 2000 2500 

E 

cc 

FIG. 2; The ratio of average Google ranks in the commu- 
nity and the outside world Gc/Gw as a function of Ecc - 
the number of intra-community links - in two series of model 
networks with varying degree of community structure. Open 
circles correspond to a beneficial effect of the community 
structure on Google ranking in a scale-free network with 
{Kout)c = 5.24 < {Kin)c = 5.9. On the other hand, filled 
squares show a detrimental eflfect in another series of net- 
works where {Kout)c ~ 5.6 > {Ki„)c ~ 4.8. Solid lines are 
fits with the Eq. I^with a given set of parameters for each of 
the networks. All networks with 10, 000 nodes have a commu- 
nity of 500 nodes were generated by the Metropolis rewiring 
algorithm described later on in the text. 

zero value of a ~ 0.15. In this case one needs to con- 
sider the contribution to currents Jew and J^c due to 
surfers' random jumps that do not follow the existing 
hyperlinks. The total number of random walkers re- 
siding on the nodes inside the community is GcNc and 
the probability of them to randomly jump to a node in 
the outside world is Nw/{Nc -I- N^)- So the contribu- 
tion to the outgoing current due to such jumps is given 
by aGcNcNw/{Nc + N^) which for Nc < can be 
simplified as aGcNc- The total outgoing current then 
can then be written as Jew = (1 ~ ot)GcEcw/ {Kout)c + 
aGcNc- Similarly the incoming current Jwc is given by 
(1 - a)GwEwc/{Kout)w + aGyjNc. The Eq. El remains 
valid for a > if one replaces E^c and Ecw with "effec- 
tive" numbers of edges E^c and E*^ given by 

E*,^ = Ecwi^ - a) + Nc{Kout)ca ; 

Kc = Ewc{l -a) + Nc{Kout)wa . (4) 

These effective numbers take into account contributions 



to both currents due to random jumps. For a numeri- 
cal test of the validity of our analytical results we gen- 
erated an ensemble of directed networks with scale-free 
distributions of in- and out-degrees: P{Kin) ~ 
and P{Kout) ~ ^omI'^ correspondingly. The exponents 
were selected to be identical to their values in the actual 
WWW network '^,'7]. The community structure in those 
networks was artificially created using the Metropolis 
rewiring algorithm described in the next section. As a re- 
sult a pre-selected group of Nc nodes formed an artificial 
community with the exact number of intra-community 
links controlled by the parameters of our simulation. The 
Fig. 13 shows the results of a numerical test of Eq. [3 
in those model networks. For numerical studies of net- 




FIG. 3: The ratio of average Google ranks in the community 
and the outside world Gc/Gw as a function of the ratio of 
effective numbers of links E^c/Ecw As predicted by the Eq. 
|21these two ratios are basically equal to each other. Different 
symbols correspond to series of networks described in Fig. 2 

works with a community structure one needs an efficient 
algorithm to generate them. In this work we propose 
a version of the Metropolis random rewiring algorithm 
introduced earlier in 8]. The algorithm starts from a 
"seed" network with the desired (scale-free in our case) 
distributions of in- and out-degrees. Such a seed network 
can be created e.g. using a stub reconnection procedure 
described in The heart of our algorithm is the local 
rewiring (edge switching) step which strictly conserves 
separately the in- and out-degrees of every node involved 
|10| . The only parameters of the Metropolis part of our 
algorithm are an auxiliary Hamiltonian (energy function) 
H = —Ecc defined as the negative of the number of intra- 
community links and the inverse temperature /?. The 
steps of the algorithm are as follows: 1) Randomly pick 
two links, say A— >B and C— >D; 2) Attempt to rewire 
them (switch their neighbors) to A-^D and C^B. If at 
least one of these two new links already exists in the net- 
work, abort this step and go back to step 1; 3) If the 
rewiring step decreases the Hamiltonian H it is always 
accepted, while if it increases the Hamiltonian by AH 
it is accepted only with probability exp(— /3Ai/). If the 
rewiring step is rejected on steps 2 or 3, the network is 
returned to the original configuration A— >B and C— >D; 
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4) Repeat the above steps until the number of Hnks in- 
side the community E^c reaches a steady state value. The 
reciprocal temperature (3 thus indirectly determines the 
number of links within the community Ecc{P) so that an 
ordinary random (scale-free) network without any com- 
munity structure is realized at an "infinite temperature" 
{(3 — 0), while the algorithm run at zero temperature 
(/3 = oo) produces a network with the largest possi- 
ble number of links within the community. One could 
also invert the sign in the definition of the Hamiltonian 
H — Ecc- Formally this can be thought of as running the 
algorithm with the original Hamiltonian but a negative 
inverse temperature /3 < 0. Large negative values of /3 
generate networks with an anti-community structure in 
which the number of intra-community links is lower than 
in a random network. The relation between Ecc and /3 for 
both positive and negative values of /3 is shown in Fig. 
0] To analytically derive the relation between Ecc and 
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FIG. 4: The number of intra-community links Ecc in net- 
works generated by the rewiring algorithm as a function of 
the inverse temperature /3. Negative values of /3 correspond 
to networks with anti-community structure and are generated 
by changing the sign in front of the Hamiltonian H. The solid 
line is the fit with the analytical expression obtained by solv- 
ing the Eq. |K|for Ecc- The inset shows the same plot with a 
logarithmic scale of the Y-axis. 

the reciprocal temperature /3, we consider the detailed 
balance in the steady state of the rewiring procedure, in 
which the probabilities of an increase and a decrease in 
Ecc must be equal to each other. Ecc is increased by I if 
the links picked at a given step of the rewiring algorithm 
are C^W and W— >C (here C stands for any node inside 
the community and W - in the outside world). The prob- 
ability to pick such pair is proportional to EcwEwc- On 
the other hand, if the selected links are C^C and W^W 
the number of links in the community would decrease by 



one with a probability exp(— /3). All other selections of 
links do not change the Ecc- The detailed balance equa- 
tion for the rewiring procedure thus reads: 

EcwEnic — EccEnj^c ^ (5) 

Additional constraints (i) Ecc + E^c = {Kin)cNc (the 
sum of in-degrees of all nodes within the community), 
(ii) Ecc + Ecw — {KoutlcNc (the sum of out-degrees of all 
nodes within the community) and (iii) Ecc + Ecw + E^c — 
E (the total number of edges in the network) plugged into 
the Eq. © result in a quadratic equation for Ecc as a 
function of {Kin)c, {Kout)c, E, and /? - the parameters 
strictly conserved in our rewiring algorithm. The Fig. 0] 
compares the analytical expression for EcdP) obtained 
by solving the Eq. |31with numerical simulations for dif- 
ferent values of /?. Clearly, Ecc increases with (3 in gen- 
eral accord with the Eq. [S] When (3 is sufficiently large, 
Ecc exponentially approaches a limiting value equal to 
ma.x.{{Kin)c, {Kout)c)Nc - the maximal number of links 
within a community given the set of in- and out-degrees 
of its nodes. The deviations between the analytical for- 
mula and numerical results visible for large values of (3 
could be attributed to the "no multiple edges" restriction 
in networks generated by our rewiring algorithm. As the 
density of inter-community links increases with (3 more 
and more of the rewiring steps leading to an increase 
of Ecc have to be aborted as the new link they are at- 
tempting to create within a community already exists. 
This situation is more appropriately described by the 
following equation: EcwEwd'^ - Ecc/E){l - E^jw/E) = 
EccEwwi)^ — Ecw/E){l — E^c/ E)e~^ , reminiscent of the 
detailed balance equation in two-fermion scattering (see 
also [ni)- 

In summary, we investigated how the WWW com- 
munity structure affects the Google rank of webpages 
belonging to a given community. We have shown that 
depending on the balance between average in- and out- 
degrees of webpages inside the community the excess den- 
sity of intra-community hyperlinks can either boost or de- 
crease the average Google ranking of its webpages. For 
numerical studies of scale-free networks with a commu- 
nity structure we developed a version of the Metropolis 
rewiring algorithm first proposed by one of us in 8] . This 
algorithm allows one to generate a random network with 
a desired density of intra-community links and a given 
distribution of in- and out-degrees. 
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of Material Science, U.S. Department of Energy. 
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